nlp_architect.models.np2vec.NP2vec

class nlp_architect.models.np2vec.NP2vec(corpus, corpus_format='txt', mark_char='_', word_embedding_type='word2vec', sg=0, size=100, window=10, alpha=0.025, min_alpha=0.0001, min_count=5, sample=1e-05, workers=20, hs=0, negative=25, cbow_mean=1, iterations=15, min_n=3, max_n=6, word_ngrams=1, prune_non_np=True)[source]

Initialize the np2vec model, train it, save it and load it.

__init__(corpus, corpus_format='txt', mark_char='_', word_embedding_type='word2vec', sg=0, size=100, window=10, alpha=0.025, min_alpha=0.0001, min_count=5, sample=1e-05, workers=20, hs=0, negative=25, cbow_mean=1, iterations=15, min_n=3, max_n=6, word_ngrams=1, prune_non_np=True)[source]

Initialize np2vec model and train it.

Parameters
  • corpus (str) – path to the corpus.

  • corpus_format (str {json,txt,conll2000}) – format of the input marked corpus; txt and json

  • are supported. For json format, the file should contain an iterable of (formats) –

  • Each sentence is a list of terms (sentences.) –

  • training.

  • mark_char (char) – special character that marks NP’s suffix.

  • word_embedding_type (str {word2vec,fasttext}) – word embedding model type; word2vec and

  • are supported. (fasttext) –

  • np2vec_model_file (str) – path to the file where the trained np2vec model has to be

  • stored.

  • binary (bool) – boolean indicating whether the model is stored in binary format; if

  • is fasttext and word_ngrams is 1, binary should be set to True. (word_embedding_type) –

  • sg (int {0,1}) – model training hyperparameter, skip-gram. Defines the training

  • If 1, CBOW is used,otherwise, skip-gram is employed. (algorithm.) –

  • size (int) – model training hyperparameter, size of the feature vectors.

  • window (int) – model training hyperparameter, maximum distance between the current and

  • word within a sentence. (predicted) –

  • alpha (float) – model training hyperparameter. The initial learning rate.

  • min_alpha (float) – model training hyperparameter. Learning rate will linearly drop to

  • as training progresses. (min_alpha) –

  • min_count (int) – model training hyperparameter, ignore all words with total frequency

  • than this. (lower) –

  • sample (float) – model training hyperparameter, threshold for configuring which

  • words are randomly downsampled, useful range is (higher-frequency) –

  • workers (int) – model training hyperparameter, number of worker threads.

  • hs (int {0,1}) – model training hyperparameter, hierarchical softmax. If set to 1,

  • softmax will be used for model training. If set to 0, and negative is non- (hierarchical) – zero, negative sampling will be used.

  • negative (int) – model training hyperparameter, negative sampling. If > 0, negative

  • will be used, the int for negative specifies how many "noise words" should be (sampling) –

  • drawn (usually between 5-20) –

  • cbow_mean (int {0,1}) – model training hyperparameter. If 0, use the sum of the context

  • vectors. If 1, use the mean, only applies when cbow is used. (word) –

  • iterations (int) – model training hyperparameter, number of iterations.

  • min_n (int) – fasttext training hyperparameter. Min length of char ngrams to be used

  • training word representations. (for) –

  • max_n (int) – fasttext training hyperparameter. Max length of char ngrams to be used for

  • word representations. Set max_n to be lesser than min_n to avoid char (training) –

  • being used. (ngrams) –

  • word_ngrams (int {0,1}) – fasttext training hyperparameter. If 1, uses enrich word

  • with subword (vectors) –

  • prune_non_np (bool) – indicates whether to prune non-NP’s after training process.

Methods

__init__(corpus[, corpus_format, mark_char, …])

Initialize np2vec model and train it.

is_marked(s)

Check if a string is marked.

load(np2vec_model_file[, binary, …])

Load the np2vec model.

save([np2vec_model_file, binary, …])

Save the np2vec model.

is_marked(s)[source]

Check if a string is marked.

Parameters

s (str) – string to check

classmethod load(np2vec_model_file, binary=False, word_ngrams=0, word2vec_format=True)[source]

Load the np2vec model.

Parameters
  • np2vec_model_file (str) – the file containing the np2vec model to load

  • binary (bool) – boolean indicating whether the np2vec model to load is in binary format

  • word_ngrams (int {1,0}) – If 1, np2vec model to load uses word vectors with subword (

  • information. (ngrams)) –

  • word2vec_format (bool) – boolean indicating whether the model to load has been stored in

  • word2vec format. (original) –

Returns

np2vec model to load

save(np2vec_model_file='np2vec.model', binary=False, word2vec_format=True)[source]

Save the np2vec model.

Parameters
  • np2vec_model_file (str) – the file containing the np2vec model to load

  • binary (bool) – boolean indicating whether the np2vec model to load is in binary format

  • word2vec_format (bool) – boolean indicating whether to save the model in original

  • format. (word2vec) –